Section 4 Extra skills

In this section of my portfolio I will show you some of the extra skills I have developed during my data science minor.

4.1 Writing an introduction using Zotero for references

Writing research papers is a pretty important for the research field, so using references to other papers is crucial for writing a good introduction. Below here you can see an introduction to my project about liquid biopsies, it makes use of multiple references.

Neuroblastoma is the fourth most common tumor in children and presents itself as either a low-risk neuroblastoma or a high-risk neuroblastoma. Determining which risk neuroblastoma is present a tumor biopsy is necessary. (Weiser et al. 2019) however these biopsies are very invasive and can only be done a few times even tho the tumor keeps mutating. To counter this problem researchers have become more and more interested in liquid biopsies to be able to track the mutations in the tumor. This is possible because the tumor often excretes DNA into the blood stream which is then used for whole-exome sequencing (WES). these results can be compared with the DNA of the tumor to show how the tumor evolves over time (Chicard et al. 2018). because the tumor cells have different properties depending on what part of the tumor they are on it is difficult to show all the mutations based on 1 tumor biopsy as that can have a bias for only that specific part of the tumor where the biopsy was taken from. Liquid biopsies also help with this problem as they sequence all the DNA that has been excreted by the tumor, this makes it possible to spot different mutations in both tumor DNA and cell free DNA (cf-DNA). (Van Paemel et al. 2022)
The type of mutations that get the most attention are the copy number variations/aberrations (CNV/CNA) these CNV’s can either be very small with just a few kilobases or very big where they cover the whole chromosome. The CNV’s are very important as they can give an identification on how pathogenic the tumor is based on the genes it contains, the position of the CNV and the size of the CNV (Riggs et al. 2020).
Some hospitals sadly don’t have easy ways to analyse the data from all these patients because they do not have the experience with data analysis. This makes the process of analyzing the data a slow and tedious task. Even tho it would be extremely beneficial for the hospitals without data scientists to have the programs available to analyse these results quickly (Valsesia et al. 2013), this takes away the time consuming task of having to analyze every file manually which gives them time to focus on more important things like treating the patient. Sharing these results to other hospitals is equally as important because there is not nearly enough data available to say with certainty how dangerous certain tumors are and if the tumors have been fully removed. With all the data combined the research towards liquid biopsies can evolve quickly making diagnoses easier and more reliable
In this project we will analyse the tumor DNA and the cfDNA created by WES and will make it reproducible so that Princess maxima centre can easily analyse the data for all their patients.
To accomplish this we will mostly focus on:
– Giving all the CNV’s different ID’s so they can be easily distinguished from each other.
– Showing which cytogenetic band the CNV falls in to.
– Making an interactive plot to look up genes easily.
– Automatically filtering genes that indicate high risk neuroblastoma’s.
– Making a high throughput version so that it will analyse multiple datasets at the same time without having to manually insert all the data sets – Getting these results quickly and easily so the researcher does not have to focus on how to analyze the data.

4.2 Creating paramaters for different data inputs

To show my ability to use paramaters I will be using data from the ECDC. the data is available in this repository under “data/COVID_cases_31_05_2022”

# loading in data
cases <- read.csv("data/COVID_cases_31_05_2022.csv")

# filtering the params used
cases_filtered <- cases %>% dplyr::filter(countriesAndTerritories == params$country, year == params$year, month >= params$period_start, month <= params$period_end) 

# telling R the dateRep column is a date
cases_filtered$dateRep <- as.Date(cases_filtered$dateRep, format = "%d/%m/%Y")

# making a graph for cases
cases_graph <- cases_filtered %>%
  ggplot(aes(x = dateRep, y = cases)) +
  geom_point(size = .5) +
  geom_line() +
  labs(title = paste("Covid related cases from month", params$period_start, "to", params$period_end, "in", params$year, "for", params$country),
       x = "Month",
       y = "Covid related cases") +
  theme_classic()

ggplotly(cases_graph)
# making a graph for deaths
deaths_graph <- cases_filtered %>%
  ggplot(aes(x = dateRep, y = deaths)) +
  geom_point(size = .5) +
  geom_line() +
  labs(title = paste("Covid related deaths from month", params$period_start, "to", params$period_end, "in", params$year, "for", params$country),
       x = "Month",
       y = "Covid related deaths") +
  theme_classic()

ggplotly(deaths_graph)

If you want to recreate these graphs with different parameters, clone this repository and use put this command in the console (with your own params ofcourse): bookdown::render_book(params = list(country = "Netherlands", year = 2021, period_start = 5, period_end = 10))

4.3 Using SQL to store data

Databases are an important feature of data science (it is even in the name), thats why being able to request data from a database is equally as important as actually analyzing the data.

First we need to make a new database, this has been done with the use of DBeaver and the code for creating the database can be found below.

SQL script of database creation

after creating the database we need to make a connection with DBeaver.
to reproduce this:
* make a new database in dbeaver called “dengue_flu_data”.
* insert your own password in the yaml header of this file in my repository.


IMPORTANT MESSAGE: The source of the data used for these analyses is NOT available anymore, so the conclusions that have been formed based on the graphs are conclusions based on data that can not be confirmed anymore.

library(DBI)

con <- dbConnect(RPostgres::Postgres(),
                 dbname = "dengue_flu_data",
                 host = "localhost",
                 port = "5432",
                 user = "postgres",
                 password = params$password)
library(tidyverse)
library(dslabs)
library(car)

gapminder <- gapminder
dengue_data <- read_csv("data/dengue_data.csv", skip = 11)
flu_data <- read_csv("data/flu_data.csv", skip = 11)

dengue_data %>% head(5)
## # A tibble: 5 x 11
##   Date       Argentina Bolivia Brazil India Indonesia Mexico Philippines
##   <date>         <dbl>   <dbl>  <dbl> <dbl>     <dbl>  <dbl>       <dbl>
## 1 2002-12-29        NA   0.101  0.073 0.062     0.101 NA              NA
## 2 2003-01-05        NA   0.143  0.098 0.047     0.039 NA              NA
## 3 2003-01-12        NA   0.176  0.119 0.051     0.059  0.071          NA
## 4 2003-01-19        NA   0.173  0.17  0.032     0.039  0.052          NA
## 5 2003-01-26        NA   0.146  0.138 0.04      0.112  0.048          NA
## # ... with 3 more variables: Singapore <dbl>, Thailand <dbl>, Venezuela <dbl>
flu_data %>% head(5)
## # A tibble: 5 x 30
##   Date       Argentina Australia Austria Belgium Bolivia Brazil Bulgaria Canada
##   <date>         <dbl>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>    <dbl>  <dbl>
## 1 2002-12-29        NA        NA      NA      NA      NA    174       NA     NA
## 2 2003-01-05        NA        NA      NA      NA      NA    162       NA     NA
## 3 2003-01-12        NA        NA      NA      NA      NA    174       NA     NA
## 4 2003-01-19        NA        NA      NA      NA      NA    162       NA     NA
## 5 2003-01-26        NA        NA      NA      NA      NA    131       NA     NA
## # ... with 21 more variables: Chile <dbl>, France <dbl>, Germany <dbl>,
## #   Hungary <dbl>, Japan <dbl>, Mexico <dbl>, Netherlands <dbl>,
## #   `New Zealand` <dbl>, Norway <dbl>, Paraguay <dbl>, Peru <dbl>,
## #   Poland <dbl>, Romania <dbl>, Russia <dbl>, `South Africa` <dbl>,
## #   Spain <dbl>, Sweden <dbl>, Switzerland <dbl>, Ukraine <dbl>,
## #   `United States` <dbl>, Uruguay <dbl>

As we can see both the flu and dengue data are not tidy ( > 1 observations per row). This can easily be fixed with pivot_longer()

dengue_tidy <- dengue_data %>% pivot_longer(cols = Argentina:Venezuela,
                                            names_to = "country",
                                            values_to = "Dengue_cases")
dengue_tidy %>% head(5)
## # A tibble: 5 x 3
##   Date       country   Dengue_cases
##   <date>     <chr>            <dbl>
## 1 2002-12-29 Argentina       NA    
## 2 2002-12-29 Bolivia          0.101
## 3 2002-12-29 Brazil           0.073
## 4 2002-12-29 India            0.062
## 5 2002-12-29 Indonesia        0.101
flu_tidy <- flu_data %>% pivot_longer(cols = Argentina:Uruguay,
                                      names_to = "country",
                                      values_to = "Flu_cases")
flu_tidy %>% head(5)
## # A tibble: 5 x 3
##   Date       country   Flu_cases
##   <date>     <chr>         <dbl>
## 1 2002-12-29 Argentina        NA
## 2 2002-12-29 Australia        NA
## 3 2002-12-29 Austria          NA
## 4 2002-12-29 Belgium          NA
## 5 2002-12-29 Bolivia          NA
# data is now tidy

If we want to eventually merge these 3 data files together we need to make the date and country variables equal:
– Date from flu and dengue data need to be split into year, month, day.
– Country needs to be of class factor.

dengue_tidy$country <- as.factor(dengue_tidy$country) 
flu_tidy$country <- as.factor(flu_tidy$country)

dengue_tidy <- dengue_tidy %>% separate(Date, into = c("year", "month", "day"), convert = T, sep = "-")
flu_tidy <- flu_tidy %>% separate(Date, into = c("year", "month", "day"), convert = T, sep = "-")

We will store this data so we always have our tidy versions by hand.

dengue_tidy %>% write.csv("data/denguetidy.csv")
dengue_tidy %>% write_rds("data/denguetidy.rds")

flu_tidy %>% write.csv("data/flutidy.csv")
flu_tidy %>% write_rds("data/flutidy.rds")

gapminder %>% write.csv("data/gapminder.csv")
gapminder %>% write_rds("data/gapminder.rds")

Now we can insert these dataframes into SQL.

dbWriteTable(con, "dengue", dengue_tidy)
dbWriteTable(con, "flu", flu_tidy)
dbWriteTable(con, "gapminder", gapminder)

To check if everything imported correctly we will go to DBeaver and inspect the data.

The imported dengue data The imported flu data The imported gapminder data

We can also get the datasets back and check them using R

dengueSQL <- dbReadTable(con, "dengue")
fluSQL <- dbReadTable(con, "flu")
gapminderSQL <- dbReadTable(con, "gapminder")

fluSQL %>% head(5)
##   year month day   country Flu_cases
## 1 2002    12  29 Argentina        NA
## 2 2002    12  29 Australia        NA
## 3 2002    12  29   Austria        NA
## 4 2002    12  29   Belgium        NA
## 5 2002    12  29   Bolivia        NA
gapminderSQL %>% head(5)
##               country year infant_mortality life_expectancy fertility
## 1             Albania 1960           115.40           62.87      6.19
## 2             Algeria 1960           148.20           47.50      7.65
## 3              Angola 1960           208.00           35.98      7.32
## 4 Antigua and Barbuda 1960               NA           62.97      4.43
## 5           Argentina 1960            59.87           65.39      3.11
##   population          gdp continent          region
## 1    1636054           NA    Europe Southern Europe
## 2   11124892  13828152297    Africa Northern Africa
## 3    5270844           NA    Africa   Middle Africa
## 4      54681           NA  Americas       Caribbean
## 5   20619075 108322326649  Americas   South America

It seems like inserting it into DBeaver and pulling it back changes the data classes for every variable.
This is not a big problem, we just have to make sure to change the classes of certain values whenever we import something from DBeaver.
We can now combine these 3 dataframes into 1 big dataframe based on country and year.

flu_dengue_combined <- full_join(dengueSQL, fluSQL, by = c("country", "year", "month", "day"))

# lets check how many years all 3 dataframes are covering

flu_dengue_combined$year %>% min()
## [1] 2002
flu_dengue_combined$year %>% max()
## [1] 2015
# flu and dengue have a range from 2002 to 2015

gapminderSQL$year %>% min()
## [1] 1960
gapminderSQL$year %>% max()
## [1] 2016
# gapminder has a range from 1960 to 2016.
# we dont need all those years so we will cut those out

gapminder_filtered <- gapminderSQL %>% filter(year >= 2002) %>% filter(year <= 2015)

# lets update the gapminder data in SQL so it only contains this dataframe instead of the big one because it is not required

dbWriteTable(con, "gapminder", gapminder, overwrite = T)

# now we can combine all the data frames

combined_all <- full_join(flu_dengue_combined, gapminder_filtered, by = c("country", "year"))
dbWriteTable(con, "gap_flu_den", combined_all)

Now that we have a table with all the data combined we can ask SQL to extract specific data to do some analysis on.

Lets say we want to know if there is a difference in the amount of flu cases of brazil and argentina in 2007.

# getting all the relevant data

arg_bra_flucases <- dbGetQuery(con, 
                                  "SELECT 
                                      \"month\",
                                      country, 
                                      gfd.\"Flu_cases\", 
                                      population
                                   FROM 
                                      gap_flu_den gfd
                                   WHERE 
                                      country IN (\'Argentina\', \'Brazil\') AND
                                      \"year\" = 2007;")

# because brazil has a much higher population we will use the cases per 100.000 citizens.

arg_bra_flucases <- arg_bra_flucases %>% mutate(per_100.000 = Flu_cases * 100000 / population)

summary_arg_bra_flu <- arg_bra_flucases %>%
  group_by(month, country) %>%
  summarise(total = sum(per_100.000))
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
summary_arg_bra_flu %>%
  ggplot(aes(x = month.name[month], y = total, colour = country)) +
  geom_point() +
  geom_path(aes(group = country)) +
  labs(title = "Flu cases per 100.000 citizens in argentina and brazil",
       subtitle = "Data from 2007",
       x = "Month of the year",
       y = "Cases per 100.000 citizens") +
    scale_x_discrete(limits = month.name) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 60, vjust = 0.7))

# on first glance Argentina seems to have more cases per 100.000 citizens, but just to make sure we will do a two sample t-test

summary_arg_bra_flu %>%
  group_by(country) %>%
  summarise(p.value.normality = shapiro.test(total)$p.value)
## # A tibble: 2 x 2
##   country   p.value.normality
##   <chr>                 <dbl>
## 1 Argentina             0.210
## 2 Brazil                0.312
# both have p > 0.05, so they are normally distributed, now we can continue with levennes test.

leveneTest(summary_arg_bra_flu$total, as.factor(summary_arg_bra_flu$country), center = mean)$P
## [1] 0.0001063121           NA
# p < 0.05, so they do not have equal variance. we can now perform the t test

t.test(formula = summary_arg_bra_flu$total ~ summary_arg_bra_flu$country,
       paired = F, var.equal = F)$p.value
## [1] 0.0003999675
# p < 0.001, There is a significant difference between the total amount of cases per 100.000 citizens in Argentina and Brazil.

This had some interesting results. lets try something different this time.
Lets see if some south-east asian countries have reduced their dengue infections between 2002 - 2015.

SEAsia_dengue_cases <- dbGetQuery(con, 
                                  "SELECT 
                                      gfd.\"year\",
                                      gfd.\"Dengue_cases\",
                                      country
                                   FROM 
                                      gap_flu_den gfd
                                   WHERE 
                                      country IN (\'Singapore\', \'Indonesia\', \'Thailand\', \'Philippines\')")

summary_SEAsia_den_cases <- 
  SEAsia_dengue_cases %>%
  group_by(year, country) %>%
  summarize(total = sum(Dengue_cases))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
summary_SEAsia_den_cases %>%
  ggplot(aes(x = year, y = total, colour = country)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = seq(2002, 2015, by = 2)) +
  labs(title = "Total dengue cases per year",
       x = "Year",
       y = "Total cases per year") +
  theme_classic()
## Warning: Removed 9 rows containing missing values (geom_point).
## Warning: Removed 9 row(s) containing missing values (geom_path).

The dengue cases seem to stay mostly stagnant with a few peaks here and there for every country, but in 2015 all dengue cases seem to have dropped which could be a good sign.
Now lets take a look at one last graph. It is always said that the flu peak is always during winter because the cold makes it easier for the influenza virus to infect people. lets check this for ourselves by checking the flu incidence for the year 2009 in 5 northern hemisphere countries.

flu_incidence_northern <- dbGetQuery(con,
                                     "SELECT
                                        country,
                                        \"month\",
                                        gfd.\"Flu_cases\"
                                      FROM 
                                        gap_flu_den gfd
                                      WHERE 
                                        country IN (\'Netherlands\', \'Norway\', \'France\', \'Sweden\', \'Switzerland\') AND
                                        \"year\" = 2009")

summary_flu_inc_north <- flu_incidence_northern %>%
  group_by(month, country) %>%
  summarize(total = sum(Flu_cases))
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
summary_flu_inc_north %>%
  ggplot(aes(x = month.name[month], y = total, group = country, fill = country)) +
  geom_col(position = position_dodge()) +
  labs(title = "Flu incidence for different countries",
       subtitle = "Data from 2009",
       x = "Month",
       y = "Flu incidence") +
  scale_x_discrete(limits = month.name) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 60, vjust = 0.8))



As we can see in this graph, most flu incidence occurs during the winter so saying that people get sick from the flu in the winter would be correct.

4.4 Creating a package

Creating a package can save people allot of time, because they don’t have to write the same piece of code multiple times.
That is why having the ability to make a package is very important as a data scientist.


Because I am obsessed with bees I decided to make my life a bit easier, that’s why I created a package that will perform my calculations for me so I can save myself some time and make my life just that extra bit easier.
You can find this package by clicking this link.

You can download the package by using these two lines.

install.packages("devtools")
devtools::install_github("thijmenvanbrenk/beecalculator")

By using help(package = "beecalculator") you will get a nice overview of the available data and functions available in this package.
To get a detailed description of what everything does use browseVignettes("beecalculator") to see all the possibilities with this package.

4.5 Creating an epidemiology map

With the infrastructure we have created to make it easier for people to connect, we have also made the infrastructure for diseases to spread. Many of these diseases cause allot of trouble and can weaken whole societies, just take a look at SARS-COV-19. To figure out how these diseases spread is essential to figure out how a pathogen behaves. Figuring this out gives us the possibility to take precautions so it does not happen again, as can be seen by this report from the CDC.
In my future I want to be able to process epidemiological data to figure out where a disease originated, how it spread to different people and make this data easily readable for non data scientists with the help of shiny.
To learn these skills I have made a small plan for a few steps I want to go through:
1. Find a suitable outbreak with available data.
2. Process the data to get simple phylogenetic trees.
3. Add regions to these trees.
4. Create a shiny to make it easy for others to create these graphs
5. Give more parameters, like showing only specific regions or adding different species. (optional depending on time)

4.5.1 Finding an outbreak

To be able and actually to create phylogenetic trees I need to have some data available to me.
I dont want a massive amount of data like from the covid pandemic because this would be too much data to process while this is only for my own learning purpose.
the NCBI has a very detailed database of different virus variation, for this exercise I will be using the data from the MERS-Cov outbreak (2013-2019).
If you want to download this data for yourself go to THIS SITE and follow the following steps:
1. Go to “MERS coronavirus” 2. Select nucleotide sequence type.
3. Select Human host.
4. Select S genome region.
5. Click “additional filters”, type “complete cds” and switch to definition lines.
6. Click “Add query”.
7. Click “Show results”. (this will redirect you to a new screen)
8. Make sure only the S genome regions are selected!
9. Click “Customize label”, make sure the label only contains “{accession}”. (this makes it easier to see later on)
10. Click “Download” with as download option “Nucleotide (FASTA)”.
11. Click “Download” but now with as download option “Result set (CSV)”. (This is the metadata)


You now have 2 files:
1. The nucleotide sequence of all samples taken from human hosts with the accession number.
2. The metadata that shows all the information for each accession number.


I will only be working with these 2 datafiles.

4.5.2 Creating Phylogenetic trees

For learning how to create these phylogenetic trees I have taken inspiration from a small tutorial on how to create Phylogenetic trees. You can find this tutorial Here.
After getting an idea of how to make these trees I have taken further inspiration from ggtree and all the help for that on google.

# now we need to load in the nucleotides
dna <- readDNAStringSet(here("data/MERS_nucleotides.fa"))
dna
## DNAStringSet object of length 168:
##       width seq                                             names               
##   [1]  4062 ATGATACACTCAGTGTTTCTAC...GCATAAGGTTCATGTTCACTAA KT805988
##   [2]  4062 ATGATACACTCAGTGTTTCTAC...GCATAAGGTTCATGTTCACTAA KT805968
##   [3]  4062 ATGATACACTCAGTGTTTCTAC...GCATAAGGTTCATGTTCACTAA KT357808
##   [4]  4062 ATGATACACTCAGTGTTTCTAC...GCATAAGGTTCATGTTCACTAA KM027279
##   [5]  4062 ATGATACACTCAGTGTTTCTAC...GCATAAGGTTCATGTTCACTAA KT805976
##   ...   ... ...
## [164]  4062 ATGATACACTCAGTGTTTCTAC...GCATAAGGTTCATGTTCACTAA KM027264
## [165]  4062 ATGATACACTCAGTGTTTCTAC...GCATAAGGTTCATGTTCACTAA MH978886
## [166]  4062 ATGATACACTCAGTGTTTCTAC...GCATAAGGTTCATGTTCACTAA KT806039
## [167]  4062 ATGATACACTCAGTGTTTCTAC...GCATAAGGTTCATGTTCACTAA KM027269
## [168]  4062 ATGATACACTCAGTGTTTCTAC...GCATAAGGTTCATGTTCACTAA KY673146
# and the metadata
metadata <- read.csv(here("data/MERS_annotation.csv"))
metadata %>% head(5) %>% knitr::kable()
accession length genome_region host country isolation collection_date release_date name
KJ782549 4062 S Homo sapiens Greece oronasopharynx 2014/04/18 2014/05/13 Middle East respiratory syndrome coronavirus strain Greece-Saudi Arabia_2014 S protein (S) gene, complete cds
KF811036 4062 S Homo sapiens Tunisia blood 2013/05/08 2014/05/19 Middle East respiratory syndrome coronavirus strain Tunisia-Qatar_2013 spike protein gene, complete cds
KM027263 4062 S Homo sapiens Saudi Arabia 2014 2014/11/12 Middle East respiratory syndrome coronavirus isolate Jeddah_C7058/KSA/2014 spike protein (S) gene, complete cds
KM027264 4062 S Homo sapiens Saudi Arabia 2014 2014/11/12 Middle East respiratory syndrome coronavirus isolate Jeddah_C7209/KSA/2014 spike protein (S) gene, complete cds
KM027265 4062 S Homo sapiens Saudi Arabia 2014 2014/11/12 Middle East respiratory syndrome coronavirus isolate Jeddah_C7311/KSA/2014 spike protein (S) gene, complete cds
# we only want the years from the metadata so lets extract those
metadata <- metadata %>% separate(collection_date, into = "collection_date", sep = 4)
metadata$collection_date <- as.numeric(metadata$collection_date)


# the DNA is loaded in nicely with the exact amount of downloaded sequences.
# there is no multiple sequence alignment necessary for these sequences, this is because they all come from the same part of the DNA.

dna <- as.DNAbin(dna) # sequence has to be the correct class

# to create a phylogenetic tree we need to calculate the distance between all the sequences
# there are many different models to choose from, and it was not easy to choose out of all the options because there are allot of upsides and allot of downsides to all the possible methods. for this time I have chosen to use the Tamaru and Nei methode (TN93). Because they take into account that its different to swap from A-G then C-T and vice versa.

dna_distance <- dist.dna(dna, model = "TN93")

# lets start by making a very simple neighbour joining phylogenetic tree.

nj_tree <- bionj(dna_distance)
plot(nj_tree, cex = .4)

# this is the most simplistic phylogenetic tree to be made and just shows the relation between the different samples.
# now lets change it up so it isnt as clutured and make it give the information we actually want

# we can also make it our tree rooted by adding the first sample ever taken

nj_rooted <- root(nj_tree, 1) %>% ladderize()

# performing a bootstrap will not only tell us how good our lines are, it also give us the option to collapse some of them

# we have to load the dna back with another method because otherwise boot.phylo cannot recognize it for some reason
dna_boots <- fasta2DNAbin(here("data/MERS_nucleotides.fa"))
## 
##  Converting FASTA alignment into a DNAbin object... 
## 
## 
##  Finding the size of a single genome... 
## 
## 
##  genome size is: 4,062 nucleotides 
## 
## ( 60  lines per genome )
## 
##  Importing sequences... 
## ..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
##  Forming final object... 
## 
## ...done.
bootstrap <- boot.phylo(nj_rooted, dna_boots, function(e) {root(bionj(dist.dna(e, model = "TN93")), 1)})
## 
Running bootstraps:       100 / 100
## Calculating bootstrap values... done.
# now we can add our bootstrap values
plot(nj_rooted, cex = .4)
axisPhylo()
nodelabels(bootstrap, cex = .7)

    # now we can collapse some of our nodes that have too low values
    nj_collapsed <- nj_rooted

    # step 1: figure out the row numbers of the branches that can be collapsed
    N <- length(nj_rooted$tip.label)
    tocollapse <- match(which(bootstrap<65)+N, nj_rooted$edge[,2])

    # step 2: get rid of those branches
    nj_collapsed$edge.length[tocollapse] <- 0
    nj_collapsed <- di2multi(nj_collapsed, tol = 0.00001)

    # step 3: check if it worked by plotting this new tree
    plot(nj_collapsed, cex = .45)

# lets add some nice colours to the plot so we can see the countries more easily

nj_tree_withcolours <- ggtree(nj_collapsed) +
                          geom_treescale()
    
    metadata$country <- as.factor(metadata$country)


nj_tree_withcolours <- nj_tree_withcolours %<+% metadata +
  geom_tiplab(aes(color = country), size = 2.5)

nj_tree_withcolours

# we can even zoom in on some parts of the tree

zoomed <- ggtree(tree_subset(nj_tree, "MK462258")) +
  geom_treescale()

zoomed %<+% metadata +
  geom_tiplab(aes(color = country), size = 2.5)

4.5.3 Making Phylogeny automated

Now that I have created something I want to be able to show easily, without all the hassle of recreating this code over and over again, I will implement this into a shiny so its more user friendly.
Because hosting a shiny in bookdown is not possible without launching it as an app I will just include the code. The original file can be found at “data/phylo.shiny.Rmd”.

ui <- fluidPage(
  
  # makes you choose your own theme (I like themes (: )
  shinythemes::themeSelector(),
  
  titlePanel("Create your own phylogenetic tree"),
  
  sidebarLayout(
    sidebarPanel(
      
      h4("Input files"),
      
      fileInput(inputId = "fasta_file",
                label = "Select a fasta file",
                accept = ".fa"),
      
      fileInput(inputId = "metadata",
                label = "Select the metadata",
                accept = ".csv"),
      
      actionButton(inputId = "go",
                   label = "Start analysis/Reset graph"),
      
      h4("Retrieval settings"),
      
      textInput(inputId = "accession",
                label = "Insert the accession number"),
      
      actionButton(inputId = "zoom",
                   label = "Zoom in on clade"),
      
      h4("Visualization"),
      
      selectInput(inputId = "colour",
                  label = "Which factor should be distinguished on",
                  choices = c("country", "host", "isolation", "collection_date"),
                  selected = "country"),
      
      width = 3
    
    ),
    
    mainPanel(
      tabsetPanel(
        tabPanel("How to download the required files",
                 tags$div(
                   tags$h4("Requirements"),
                   "This shiny requires very specific data files, so to make sure there isnt any confusion I made a list of what these files need:",
                   tags$br(),
                   "1. All sequences have to be the same length.",
                   tags$br(),
                   "2. Metadata must contain \"accession\", \"host\", \"country\", \"isolation\" and \"collection_date\".",
                   tags$br(), tags$br(),
                   tags$h4("Downloading files"),
                   "1. Go to the site of ",
                   tags$a(href="https://www.ncbi.nlm.nih.gov/genome/viruses/variation/", "the NCBI."),
                   tags$br(),
                   "2. Go to the virus databank you want to make the tree of.",
                   tags$br(),
                   "3. Select nucleotide sequence type.",
                   tags$br(),
                   "4. Select your filters. (tip: select a region that has the full cds)",
                   tags$br(),
                   "5. Click \"Add query\" and \"Show results\".",
                   tags$br(),
                   "6. Make sure you only select the same region with the same length!!",
                   tags$br(),
                   "7.  Click \"Customize label\", make sure the label only contains \"{accession}\". (this makes it easier to see later on) ",
                   tags$br(),
                   "8. Download the \"Nucleotide (fasta)\" and \"Result set (CSV)\" options.",
                   tags$br(),
                   "9. Input these two files in this shiny and go to the next tab.",
                   tags$br(),tags$br(),
                   tags$h4("Extra information"),
                   "Creating a phylogeny tree is not easy and requires allot of factors for it to go right, thats why this shiny only has the basics. Later on I am planning on adding \"Multiple Sequence Alignment\", but I do not have time for that at this moment.",
                   tags$br(),
                   "If there are any questions feel free to contact me through",
                   tags$a(href="https://github.com/thijmenvanbrenk", "my github"
                   ))),
        
        tabPanel("Phylogenetic tree",
                 tableOutput("meta"),
                 plotOutput("phylo"))
        
        
      
          ),
        
        
          width = 9)
    
  )
)

This is what the ui looks like.

server <- function(input, output, session) {
  
  # inputting the fasta file
  
  nucleotides <- reactive({
  
    req(input$fasta_file)
    
    readDNAStringSet(input$fasta_file$datapath)
  
  })
  # inputting the metadata file
  
  metadata <- reactive({
    
    req(input$metadata)
    
    read.csv(input$metadata$datapath)
  })
  # extracting just the years from metadata
  
  metadata_year_temp <- reactive({
    
    metadata() %>% separate(collection_date, into = "collection_date", sep = 4)
  })
  # making inputted tree
  
  nj_tree <- reactive({
    
    as.DNAbin(nucleotides()) %>%
      dist.dna(model = "TN93") %>%
      bionj()
  })

  # time to make the graph
  
  # first create te full graph
  tree_foundation <- reactive({
    
    ggtree(nj_tree()) +
      geom_treescale()
  
    })
  
  tree_full <- reactive({
    
    tree_foundation() %<+% metadata_year_temp() +
       geom_tiplab(aes_string(color = input$colour), size = 3)
  })
  
  # then the zoomed in graph
  tree_zoomed <- reactive({
   ggtree(tree_subset(nj_tree(), input$accession)) +
  geom_treescale()
  })
  
  tree_zoomed_full <- reactive({
    
    tree_zoomed() %<+% metadata_year_temp() +
  geom_tiplab(aes_string(color = input$colour), size = 3)
  })
  
  # now tell shiny which to show when
  
  tree_output <- reactiveValues(plot=NULL)
  
  observeEvent(input$go, {
  
    tree_output$plot <- tree_full()
  })
  
  observeEvent(input$zoom, {
    
    tree_output$plot <- tree_zoomed_full()
  })
  
  # now we can output the plot
  output$phylo <- renderPlot({
  
    if(is.null(tree_output$plot)) {
      
    } else {
    tree_output$plot
    }
    })
  
  # show the information about the selected accession code
  output$meta <- renderTable({
    
    output_table <- metadata_year_temp() %>% 
      filter(metadata_year_temp()$accession == input$accession)
    
    if(nrow(output_table) == 0) {
      validate("This accession number does not exist")
    } else {
      output_table
    }
    
  })
  
}

shinyApp(ui, server, options = list(height=650, width = 1300))

And these are pictures of the server in action, it can create graphs like these.

References

Chicard, Mathieu, Leo Colmet-Daage, Nathalie Clement, Adrien Danzon, Mylène Bohec, Virginie Bernard, Sylvain Baulande, et al. 2018. “Whole-Exome Sequencing of Cell-Free DNA Reveals Temporo-spatial Heterogeneity and Identifies Treatment-Resistant Clones in Neuroblastoma.” Clinical Cancer Research 24 (4): 939–49. https://doi.org/10.1158/1078-0432.CCR-17-1586.
Riggs, Erin Rooney, Erica F. Andersen, Athena M. Cherry, Sibel Kantarci, Hutton Kearney, Ankita Patel, Gordana Raca, et al. 2020. “Technical Standards for the Interpretation and Reporting of Constitutional Copy-Number Variants: A Joint Consensus Recommendation of the American College of Medical Genetics and Genomics (ACMG) and the Clinical Genome Resource (ClinGen).” Genetics in Medicine: Official Journal of the American College of Medical Genetics 22 (2): 245–57. https://doi.org/10.1038/s41436-019-0686-8.
Valsesia, Armand, Aurélien Macé, Sébastien Jacquemont, Jacques S. Beckmann, and Zoltán Kutalik. 2013. “The Growing Importance of CNVs: New Insights for Detection and Clinical Interpretation.” Frontiers in Genetics 4: 92. https://doi.org/10.3389/fgene.2013.00092.
Van Paemel, Ruben, Charlotte Vandeputte, Lennart Raman, Jolien Van Thorre, Leen Willems, Jo Van Dorpe, Malaïka Van Der Linden, et al. 2022. “The Feasibility of Using Liquid Biopsies as a Complementary Assay for Copy Number Aberration Profiling in Routinely Collected Paediatric Cancer Patient Samples.” European Journal of Cancer (Oxford, England: 1990) 160 (January): 12–23. https://doi.org/10.1016/j.ejca.2021.09.022.
Weiser, Daniel A., Diana C. West-Szymanski, Ellen Fraint, Shani Weiner, Marco A. Rivas, Carolyn Wei Ting Zhao, Chuan He, and Mark A. Applebaum. 2019. “Progress Toward Liquid Biopsies in Pediatric Solid Tumors.” Cancer Metastasis Reviews 38 (4): 553–71. https://doi.org/10.1007/s10555-019-09825-1.